Unit 6: Text Data Processing 


Table 55: TDP Terms 


Text sources Electronic text content in HTML, | comments fields, surveys, notes 
TXT or XML that can be accessed 
by Data Services and analyzed 
with Text Data Processing. 


Entity Names of people, places, things, | Steven Paul Jobs is a Person 
and values that can be extracted 
from text. Defined as a pairing of a 
standard form and its type. 

Sub-entity An embedded entity within the February, 24,1955 
containing entity. Has a prefix that 
matches that of the larger entity. 

Subtype A hierarchical specification of an | Airplane — air vehicle 
entity type distinguishing between manei hicl 
different semantic varieties of the Pall NAVE WEIS 
same type. 


Relationship, event, | Connects two or more entities to || love Acme bread! 
fact each other and can be extracted 

from unstructured text, using 

(custom) extraction rules. 


Sub-fact (OD mark- | Internal unit within a fact, thatis | The crust is perfect. 
ers) itself an embedded entity. 


Sentiment Statement of positive or negative | Pain au levain is the best bread ev- 
emotion or feeling about some- er. 
thing. Detected using extraction 
rules. 


Dictionary A file that contains a user-defined | ProdDictionary.nc 
set of entities, each specifying the 
standard form, variant forms, en- 
tity type, and so on. 


Rule File defining patterns for the ex- english-tf-voc-sentiment.rul 
traction of entities or facts. 


Text Data Processing Architecture 
The following figure shows the architecture for text data processing. 
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